-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROB: Relax flate decoding for too many lookup values #2331
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2331 +/- ##
==========================================
- Coverage 94.37% 94.32% -0.05%
==========================================
Files 43 43
Lines 7660 7666 +6
Branches 1515 1518 +3
==========================================
+ Hits 7229 7231 +2
- Misses 267 269 +2
- Partials 164 166 +2 ☔ View full report in Codecov by Sentry. |
Example: We have the lookup table |
Example document: out1.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Now we just need to improve test coverage😉
I might have a look at how to emulate the two remaining error cases where we do not have actual example files for now in the next days. |
@stefan6419846 Please correct me if I'm wrong, but I don't think it's a bugfix-PR. A bug in pypdf would mean that pypdf is either...
This PR has two components:
As robustness improvements (ROB) are typically more important for pypdf users, I'd start the PR with "ROB". See https://pypdf.readthedocs.io/en/latest/dev/intro.html (I should probably extend those :-) ) |
From my side this looks good to be merged 👍 Just let me know if I can change the PR title |
@MartinThoma No worries, I am completely fine with you adjusting the title if required, referrring to robustness instead. And adjusting the dev docs sounds like a nice idea as well to make it more clear. Feel free to merge this PR after updating the title. If I find some time during the next week, I might provide some more test data for the two new edge cases to increase coverage in a separate PR - for now, I just did not stumble upon such files as in theory they violate the PDF standard anyway, while the whitespace stuff seems more spec-like and thus appears in the real world as well. |
## What's new ### Bug Fixes (BUG) - Cope with deflated images with CMYK Black Only (#2322) by @pubpub-zz - Handle indirect objects as parameters for CCITTFaxDecode (#2307) by @stefan6419846 - check words length in _cmap type1_alternative function (#2310) by @Takher ### Robustness (ROB) - Relax flate decoding for too many lookup values (#2331) by @stefan6419846 - Let _build_destination skip in case of missing /D key (#2018) by @nickryand ### Documentation (DOC) - Note in reading form data (#2338) by @MartinThoma - Pull Request prefixes and size by @MartinThoma - Add https://github.com/zuypt for #2325 as a contributor by @MartinThoma - Fix docstring for RunLengthDecode.decode (#2302) by @stefan6419846 ### Maintenance (MAINT) - Enable `disallow_any_generics` and add missing generics (#2278) by @nilehmann ### Testing (TST) - Centralize file downloads (#2324) by @MartinThoma ### Code Style (STY) - Fix typo "steam" \xe2\x86\x92 "stream" (#2327) by @stefan6419846 - Run black by @MartinThoma - Make Traceback in bug report template uppercase (#2304) by @stefan6419846 [Full Changelog](3.17.1...3.17.2)
As mentioned in #2331, this will improve the test coverage for the edge cases. Further refactoring was necessary as iterating over bytes will yield integers instead of single bytes and thus the whitespace check has been broken. Additionally, the whitespace check has previously always been performed on the shortened bytes data.
When handling flate objects with a lookup table and the image mode
1
, we would previously raise a genericAssertionError
if the number of lookup values did not match.This PR proposes to add a more meaningful error message. Additionally, cases where too many values are specified are now considered a warning only as I could not see any real difference.
I might try to find a document version I can use for public test cases later on to at least cover the case where there are too many values.